Statistics Saturdays: Data types, Averages, and Common Lies
“Average” is probably the most misused statistical word, right up there with “Probability” and “Hefeweizen”. Today’s statistical post is going to be about the three types of data, and the three types of averages, and the ways they interact. It’ll be nifty!
Types of Data
Nominal: Data where you can’t rank it, you can’t order it, it just sits there uniquely, glowering at you. Examples of nominal data: Zip codes, names of birds, ethnicities, genders. Each of these is only meaningful in and of itself: You can’t, unless you’re some sort of dick, rank these sorts of things.
Ordinal: You can rank these things, but the differences between the ranks is not meaningful. So, if you’re ranking some restaurant food using the ranks, “Not at all spicy”, “Bland” “a little spicy” “Quite spicy” and “Flaming hornets in my mouth”, that’s an example of ordinal level data. Educational level is another example: Pre-high school, high school degree, some college, BA, MA, PhD—you can rank these, but there is no easy way to describe the difference between one level and the next nor are the intervals meaningful. The final kind of ordinal level data is interval/ratio data that has been transformed into ordinal. If you take salaries and bin them into groups of $10,000— so you have people making between 1-10,000, between 10,001-20,000, etc., even though the data is numerical, since it contains a range it is no longer interval/ratio, it is now ordinal.
Interval/Ratio: As the name indicates, this is actually two types of data, but they’re so highly related they’re usually expressed as the same thing. Basically, this is all true numeric data. Your height is interval/ratio. The number of times you’ve high-fived Van Halen is interval/ratio. The number of jobs added to the economy in a given month is interval/ratio. The difference between ‘interval’ and ‘ratio’ is simply that ratio numbers are continuous and have a meaningful zero. In other words, the number of cows on a farm is interval data: you can’t have .5 of a cow except in the weirder dimensions. You can, however, have .5 of a gallon of milk, so the milk output of the cows would be continuous data. Since we can perform most of the same mathematical operations of interval or ratio data, it generally gets treated as the same thing.
Averages
There are three words which can be used to express an ‘average’: Mode, median, and mean.
The mode is simply the most common reply in a data set. The mode is a possible measurement for all data types. Nominal data can only be measured by the mode—mean and median are meaningless for nominal data. If you take a poll of who likes what sports team (nominal level data), and three people like the New Jack Whackatattack, two people like the Dallas Cowtippers, and four people are really into the Florida Men, you can say that the data shows a mode of “Florida Men”.
Sometimes, the mode is a very meaningful measurement. For nominal data, it is the only possible measurement, but that doesn’t automatically mean it’s actually meaningful. If you poll 100 people, and get 99 different answers but two people have the same answer, their answer is the ‘mode’, but obviously this is not a meaningful result from the data.
The ‘mode’ of ordinal data can be useful to report, too. If you poll people on their opinion on a new building going up, and 20% of the people report they’re against it, 20% report they’re for it, and 60% report that they simply couldn’t be bothered to give the slightest iota of a shit about it, then that is the mode answer and it is significant.
Median: Median is the response in the middle of the data set. This is most used for ordinal level data, though it has meaning for quite a lot of interval/ratio data, especially interval data. If you have a data set for how happy people are with their chandelier-polishing service, with categories of “Very satisfied”, “Satisfied,” “Neutral”, “Dissatisfied”, and “You’re dead to me”, and you have 3 “VS”, 4, “S”, 2 “N”, 1 “D”, and 1 “YDTM”, for a total of 11 responses, you find the middle value—in this case, the sixth value has five elements on either side, so you look at the sixth value from either side. That gives us “S”, “Satisfied”, as the median response.
Mean: The mean is the ‘true’ or mathematical average, derived by adding all the totals together and dividing by N, the number of responses in the set. So if Johnny, Billy, Luke, and Fothersby make $2, $5, $7, and $5 an hour mowing lawns, they have an average per-hour rate of $4.75 an hour.
LIes, LIes, Lies:
Some of the ways that you can lie with these statistics are relatively obvious. By switching up between talking about median, mode, or mean as the ‘average’, you can do quite a bit of fibbing and misrepresentation off the bat. The mode, median, and mean of a data set can be highly, highly unrelated to each other, which generally indicates that the data is highly skewed.
For example, let’s say that a union and a company are in a wage dispute. The workers at the company say that they make an average wage of $12 an hour. The company says that they’re lying, that the average wage is in fact $21.90 an hour. Who is telling the truth?
They both are, but they’re using a different average. For average wages, if there is a small group of high wages, this can distort the mean wildly, and even though wage data is technically interval/ratio, using the median is often the better way to evaluate the average wage.
At this company, which has twenty one employees, the wages paid are, in dollars per hour:
8, 8, 8, 8, 9, 9, 10, 10, 12, 12, 12, 12, 12, 13, 13, 14, 14, 40, 40, 80, 128.
The mean of these numbers is, in fact $21.90. The median value is $12. Both claims are true, but a look at the data shows that, indeed, the wages are skewed, with a majority of the earnings at the higher end of the scale. This is why when serious people publish stuff, they always include their actual data sets, or at least the curve of that data set and the skewedness/standard deviation (which we’ll get into next week).
So is the Big Bad Company being all evil and stuff by making their claim? No, not really, or rather, it depends. Say that this is a machine shop, and those wages are people who’s time is fully taken up. So if the company wanted to expand, to open up an identical machine shop, they would have to pay a meaningful average of $21.90 per worker in opening this new shop.
However, from the perspective of the worker, or the perspective of income inequality at that company, the median is the more honest measurement. Or rather, the median as compared to the highest wages shows that there’s a strong disparity in wages. If the company is claiming it pays above-average wages, for example, it may be being duplicitous in using this number.
Another way that statistics and averages can be fudged around is in treating some ordinal level data as interval/ratio. If you ask people to rate things on a scale of seven or above, it is generally then accepted in statistics that you can use that data as interval ratio. The Scofield scale of hotness is a good example of this. It works by diluting hot stuff in water, measuring how many dilutions until no heat can be detected. This is really ordinal level data because it depends on a purely subjective measurement, of human taste of heat. In general, there isn’t a problem in treating this as interval/ratio, of saying that one chile is X times hotter than another, because the subjective measurement matches the population: humans are the subjective measurement, and humans are who care about it.
However, this gets very tricky when you’re analyzing stuff that’s both subjective and varies between groups. If you ask people to rate themselves as ‘optimistic’ on a scale between 1-100, and then treat that data as interval/ratio, you’re doing something very dubious. First of all, what I see as optimistic you may see as just neutral. Second of all, cultural differences can play a huge difference. Say that people in Australia have an average of 70 on this ‘optimism’ test, while Chinese people have ‘55’. Does this indicate a real difference between optimistic outlook, or does it represent a disagreement about what the word ‘optimistic’ means?
TL;DR section:
Nominal data is for things that can’t be ranked. Ordinal data is thinks that can be ranked. Interval/ratio data is numerical data with meaningful distinctions between numbers or for the continuous scale.
Mode is the only average available for nominal data. It is simply the most frequent response.
Median can be used for ordinal or interval/ratio data. Median is whatever response is in the middle of all the responses. If you have 1,001 responses and you rank them, the 501 response, with 500 below and 500 above, is the median response.
Mean can only be used for interval/ratio data. The mean is derived by the addition of all values, divided by the N of the population.
No matter what average is given to you, you should look at whether this was appropriate for the data type, and whether the raw data shows a skew.
In my next post, I’ll be going over the normal curve and standard deviation, which is standardized way to examine the skewness and distribution of interval/ratio data.